Categories

Versions

Finetune Text Classification (Generative Models)

Synopsis

Finetunes an LLM for Text Classification tasks

Description

Finetunes a Large Language Model for Text Classification tasks. Requires that the foundation model has been downloaded from Huggingface into a model directory which needs to be provided as an input to this operator next to the training data. You can use the Download Model and / or Load Model operators for this. This training data set must have at least two columns, namely an input and the target column. The resulting model will be stored in a directory of your project (recommended), your file system, or in a temporary location if you chose to do so. Please note that for Text Classification tasks only specific models can be used which are marked as such on Huggingface. Since finetuning is a complex topic we recommend to read this documentation here: https://docs.rapidminer.com/latest/studio/generative-ai/#finetuning-a-model

Input

  • data (Data Table)

    The training data for the finetuning task. Needs at least an input column as well as a column with the desired outcomes.

  • model (File)

    The model directory for the foundation model you are using for this finetuning job.

Output

  • model (File)

    The model directory into which the finetuned result has been stored, either a folder in your projects or your file system.

Parameters

  • storage_type Determines where the finetuned model will be stored. Either in a folder in one of your projects / repositories (recommended), in a folder of your file system, or in a temporary folder. Range:
  • project_folder The folder in a project / repository to store the finetuned model in. Range:
  • file_folder The folder in your file system to store the finetuned model in. Range:
  • input_column The name of the attribute or column which should be used as input for the fine-tuning. Range:
  • target_column The name of the attribute or column which should be used as the target for this fine-tuning. Since this is a translation task, the model will try to learn how to translate the values from the input column to those in the target column. Range:
  • max_input_tokens The maximum number of tokens allowed for the inputs. Longer sequences will be ignored. Range:
  • epochs The number of epochs for this fine-tuning. Range:
  • device Where the finetuning should take place. Either on a GPU, a CPU, or Apple’s MPS architecture. If set to Automatic, the training will prefer the GPU if available and will fall back to CPU otherwise. Range:
  • device_indices If you have multiple GPUs and computation is set up to happen on GPUs you can specify which ones are used with this parameter. Counting of devices starts with 0. The default of “0” means that the first GPU device in the system will be used, a value of “1” would refer to the second and so on. You can utilize multiple GPUs by providing a comma-separated list of device indices. For example, you could use “0,1,2,3” on a machine with four GPUs if all four should be utilized. Please note that RapidMiner performs data-parallel computation which means that the model needs to be small enough to be completely loaded on each of your GPUs. Range:
  • finetuning_mode Indicates if a full finetuning is performed or PEFT / LoRA which can dramatically accelerate the finetuning task. Range:
  • lora_r The dimension of the low-rank matrices used by LoRA. Range:
  • lora_alpha The scaling factor for the weight matrices used by LoRA. Range:
  • lora_dropout The dropout probability of the LoRA layers. Range:
  • target_modules_mode If set to None, no specific definition is made for which modules (or layers) should be finetuned with PEFT / LoRA. This is the best setting for all the models which are natively supported by PEFT. If set to Automatic, we will extract the names of all linear layers automatically which is the recommended approach. And if set to Manual, you can specify a comma-separated list of target module names yourself. Range:
  • target_modules Only shown if the target module mode is set to Manual. You can specify here a comma-separated list of target module names. Those modules would be finetuned with PEFT / LoRA then. You can see the structure of the model including the module names in the logs. Range:
  • quantization Quantization techniques reduce memory and computational costs by representing weights and activations with lower-precision data types like 8-bit or 4-bit integers. Range:
  • 16_bit_precision Whether to use 16-bit (mixed) precision training (fp16) instead of 32-bit training. Range:
  • prep_threads The number of parallel threads used for the data preprocessing. Range:
  • batch_size The batch size for this fine-tuning. The number of GPUs x batch size x gradient accumulation steps should usually be a multiple of 8. Range:
  • gradient_accumulation_steps The gradient accumulation steps used for this fine-tuning. The number of GPUs x batch size x gradient accumulation steps should usually be a multiple of 8. Range:
  • train_test_ratio The ratio of rows which is used for testing the fine-tuned model. Range:
  • learning_rate The learning rate for this fine-tuning. Range:
  • conda_environment The conda environment used for this model task. Additional packages may be installed into this environment, please refer to the extension documentation for additional details on this and on version requirements for Python and some packages which have be present in this environment. Range:

Tutorial Processes

Finetune a model for sentiment classification

This process downloads a foundation model from Huggingface and finetunes it to predict the sentiment of a text based on only 400 training examples. The finetuned model is then applied on some test examples and deleted afterwards. The original foundation model is also deleted. Obviously one would not delete the finetuned model after one application, but we wanted to keep things clean for this tutorial. Please note that this process will run for many hours based on your hardware setup.